57 research outputs found

    An efficient algorithm for the extended (l,d)-motif problem with unknown number of binding sites

    Get PDF
    Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motif-discovering algorithms/software require the length of the motif as input. Motivated by the fact that the motif's length is usually unknown in practice, Styczynsfd et al. introduced the Extended (l,d)-Motif Problem (EMP), where the motif's length is not an input parameter. Unfortunately, the algorithm given by Styczynski et al. to solve EMP can take an unacceptably long time to run, e.g. over 3 months to discover a length-14 motif. This paper makes two main contributions. First, we eliminate another input parameter from EMP: the minimum number of binding sites in the DNA sequences. Fewer input parameters not only reduces the burden of the user, but also may give more realistic/robust results since restrictions on length or on the number of binding sites make little sense when the best motif may not be the longest nor have the largest number of binding sites. Second, we develop an efficient algorithm to solve our redefined problem. The algorithm is also a fast solution for EMP (without any sacrifice to accuracy) making EMP practical. © 2005 IEEE.published_or_final_versio

    Finding exact optimal motifs in matrix representation by partitioning

    Get PDF
    Motivation: Finding common patterns, or motifs, in the promoter regions of co-expressed genes is an important problem in bioinformatics. A common representation of the motif is by probability matrix or PSSM (position specific scoring matrix). However, even for a motif of length six or seven, there is no algorithm that can guarantee finding the exact optimal matrix from an infinite number of possible matrices. Results: T his paper introduces the first algorithm, called EOMM, for finding the exact optimal matrix-represented motif, or simply optimal motif. Based on branch-and-bound searching by partitioning the solution space recursively, EOMM can find the optimal motif of size up to eight or nine, and a motif of larger size with any desired accuracy on the principle that the smaller the error bound, the longer the running time. Experiments show that for some real and simulated data sets, EOMM finds the motif despite very weak signals when existing software, such as MEME and MITRA-PSSM, fails to do so. © The Author 2005. Published by Oxford University Press. All rights reserved.postprin

    Non-adaptive complex group testing with multiple positive sets

    Get PDF
    LNCS v. 6648 is conference proceedings of TAMC 2011Given n items with at most d of them having a particular property (referred as positive items), a single test on a selected subset of them is positive if the subset contains any positive item. The non-adaptive group testing problem is to design how to group the items to minimize the number of tests required to identify all positive items in which all tests are performed in parallel. This problem is well-studied and algorithms exist that match the lower bound with a small gap of logd asymptoticically. An important generalization of the problem is to consider the case that individual positive item cannot make a test positive, but a combination of them (referred as positive subsets) can do. The problem is referred as the non-adaptive complex group testing. Assume there are at most d positive subsets whose sizes are at most s, existing algorithms either require Ω(logs n) tests for general n or O((s+d/d) log n) tests for some special values of n . However, the number of items in each test cannot be very small or very large in real situation. The above algorithms cannot be applied because there is no control on the number of items in each test. In this paper, we provide a novel and practical derandomized algorithm to construct the tests, which has two important properties. (1) Our algorithm requires only O((d+s)d+s+1/(ddss log n) tests for all positive integers n which matches the upper bound on the number of tests when all positive subsets are singletons, i.e. s = 1. (2) All tests in our algorithm can have the same number of tested items k. Thus, our algorithm can solve the problem with additional constraints on the number of tested items in each test, such as maximum or minimum number of tested items. © 2011 Springer-Verlag.postprintThe 8th Annual Conference on Theory and Applications of Models of Computation (TAMC 2011), Tokyo, Japan, 23-25 May 2011. In Lecture Notes in Computer Science, 2011, v. 6648, p. 172-18

    DMPFinder - Finding differentiating pathways with gaps from two groups of metabolic networks

    Get PDF
    Session 2B: Biological and Regulatory NetworksWhy some strains of a species exhibit a certain phenotype (e.g. drug resistant) but not the other strains of the same species is a critical question to answer. Studying the metabolism of the two groups of strains may discover the corresponding pathways that are conserved in the first group but not in the second group. However, only a few tools provide functions to compare two groups of metabolic networks which are usually limited to the reaction level, not the pathway level. In this paper, we formulate the DMP (Differentiating Metabolic Pathway) problem for finding conserved pathways exist in first group, but not the second group. The problem also captures the mutation in pathways and derives a measure (p-value and e-score) for evaluating the confident of the pathways. We then developed an algorithm, DMPFinder, to solve the DMP problem. Experimental results show that DMPFinder is able to identify pathways that are critical for the first group to exhibit a certain phenotype which is absent in the other group. Some of these pathways cannot be identified by other tools which only consider reaction level or do not take into account possible mutations among species. The software is available at: http://i.cs.hku.hk/alse/hkubrg/projects/DMPFinder/postprintThe 3rd International Conference on Bioinformatics and Computational Biology (BICoB 2011), New Orleans, LA., 23-25 March 2011

    T-IDBA: A de novo Iterative de Bruijn Graph Assembler for Transcriptome

    Get PDF
    LNCS v. 6577 entitled: Research in computational molecular biology: 15th annual international conference, RECOMB 2011 ... : proceedingsRNA-seq data produced by next-generation sequencing technology is a useful tool for analyzing transcriptomes. However, existing de novo transcriptome assemblers do not fully utilize the properties of transcriptomes and may result in short contigs because of the splicing nature (shared exons) of the genes. We propose the T-IDBA algorithm to reconstruct expressed isoforms without reference genome. By using pair-end information to solve the problem of long repeats in different genes and branching in the same gene due to alternative splicing, the graph can be decomposed into small components, each corresponds to a gene. The most possible isoforms with sufficient support from the pair-end reads will be found heuristically. In practice, our de novo transcriptome assembler, T-IDBA, outperforms Abyss substantially in terms of sensitivity and precision for both simulated and real data. T-IDBA is available at http://www.cs.hku.hk/~alse/ tidba/. © 2011 Springer-Verlag.postprin

    MetaCluster 4.0: A novel binning algorithm for NGS reads and huge number of species

    Get PDF
    Next-generation sequencing (NGS) technologies allow the sequencing of microbial communities directly from the environment without prior culturing. The output of environmental DNA sequencing consists of many reads from genomes of different unknown species, making the clustering together reads from the same (or similar) species (also known as binning) a crucial step. The difficulties of the binning problem are due to the following four factors: (1) the lack of reference genomes; (2) uneven abundance ratio of species; (3) short NGS reads; and (4) a large number of species (can be more than a hundred). None of the existing binning tools can handle all four factors. No tools, including both AbundanceBin and MetaCluster 3.0, have demonstrated reasonable performance on a sample with more than 20 species. In this article, we introduce MetaCluster 4.0, an unsupervised binning algorithm that can accurately (with about 80% precision and sensitivity in all cases and at least 90% in some cases) and efficiently bin short reads with varying abundance ratios and is able to handle datasets with 100 species. The novelty of MetaCluster 4.0 stems from solving a few important problems: how to divide reads into groups by a probabilistic approach, how to estimate the 4-mer distribution of each group, how to estimate the number of species, and how to modify MetaCluster 3.0 to handle a large number of species. We show that Meta Cluster 4.0 is effective for both simulated and real datasets. Supplementary Material is available at www.liebertonline.com/cmb. © 2012 Mary Ann Liebert, Inc.published_or_final_versio

    Deterministic Polynomial-Time Algorithms for Designing Short DNA Words

    Get PDF
    postprin

    MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning

    Get PDF
    This article is part of the supplement: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): GenomicsBackground Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction. Results In this paper, we describe MetaCluster-TA, an assembly-assisted binning-based annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaCluster-TA can outperform widely-used MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaCluster-TA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database. Conclusions MetaCluster-TA can outperform widely-used MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool.published_or_final_versio

    MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample

    Get PDF
    All proceedings papers are available as open access at: OUP Bioinformatics (http://www.eccb12.org/proceedings-talks)MOTIVATION: Metagenomic binning remains an important topic in metagenomic analysis. Existing unsupervised binning methods for next-generation sequencing (NGS) reads do not perform well on (i) samples with low-abundance species or (ii) samples (even with high abundance) when there are many extremely low-abundance species. These two problems are common for real metagenomic datasets. Binning methods that can solve these problems are desirable. RESULTS: We proposed a two-round binning method (MetaCluster 5.0) that aims at identifying both low-abundance and high-abundance species in the presence of a large amount of noise due to many extremely low-abundance species. In summary, MetaCluster 5.0 uses a filtering strategy to remove noise from the extremely low-abundance species. It separate reads of high-abundance species from those of low-abundance species in two different rounds. To overcome the issue of low coverage for low-abundance species, multiple w values are used to group reads with overlapping w-mers, whereas reads from high-abundance species are grouped with high confidence based on a large w and then binning expands to low-abundance species using a relaxed (shorter) w. Compared to the recent tools, TOSS and MetaCluster 4.0, MetaCluster 5.0 can find more species (especially those with low abundance of say 6x to 10x) and can achieve better sensitivity and specificity using less memory and running time. AVAILABILITY: http://i.cs.hku.hk/alse/MetaCluster/ CONTACT: [email protected]_or_final_versionThe 11th European Conference on Computational Biology (ECCB'12), Basel, Switzerland, 9-12 September 2012. In Bioinformatics, 2012, v. 28 n. 18, p. i356-i36

    IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth

    Get PDF
    MOTIVATION: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs. RESULTS: We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy. AVAILABILITY: The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/alse/idba_udpostprin
    corecore